An End-to-End Machine Learning Implementation for Educational Analytics
By MOHAMMAD AFROZ ALI
Aspiring SDE, AIML Intern
Final Semester B.Tech (Information Technology) - 8.0/10 CGPA
Muffakham Jah College of Engineering & Technology
Keen on Artificial Intelligence & Machine Learning
Focus on building end-to-end solutions that combine ML with software engineering best practices
Educational institutions constantly strive to understand the factors affecting student performance to enhance teaching strategies and provide targeted support. This project delivers an end-to-end machine learning solution to analyze and predict student performance based on various demographic and educational factors.
Many factors can influence a student's academic performance, from socioeconomic status to study habits. With machine learning, we can identify which factors have the strongest impact on student success and build predictive models that help educators proactively support students who may be at risk of underperforming.
The project utilizes the "Student Performance in Exams" dataset with 1000 records and the following features:
| Feature | Description | Type |
|---|---|---|
| gender | Student's gender (male/female) | Categorical |
| race_ethnicity | Ethnicity group (A through E) | Categorical |
| parental_level_of_education | Highest education level of parents | Categorical |
| lunch | Lunch type (standard/free/reduced) | Categorical |
| test_preparation_course | Whether student completed a test prep course | Categorical |
| math_score | Score in mathematics (0-100) | Numerical |
| reading_score | Score in reading (0-100) | Numerical |
| writing_score | Score in writing (0-100) | Numerical |
The primary prediction target is the math_score, while analyzing relationships with other scores and features.
The project aims to predict student math scores based on various features including gender, race/ethnicity, parental level of education, lunch type, test preparation course, reading score, and writing score.
Each component (ingestion, transformation, training, prediction) is implemented as a separate module with clear interfaces, enabling maintainability and extensibility.
Complete automation from data ingestion to model deployment, with robust logging and exception handling at each stage to ensure reliability.
Evaluation of seven regression algorithms with extensive hyperparameter tuning to find the optimal model for predicting student performance.
Seamless deployment to Azure Web App service with containerization via Docker, and CI/CD through GitHub Actions.
The project aims to analyze and predict how student performance (test scores) is affected by various factors such as:
The EDA process involved analyzing a dataset of 1,000 students with 8 features to understand the relationships between different variables and their impact on student performance.
Basic statistics of numerical features:
| Statistic | Math Score | Reading Score | Writing Score |
|---|---|---|---|
| Mean | 66.09 | 69.17 | 68.05 |
| Std | 15.16 | 14.60 | 15.19 |
| Min | 0.00 | 17.00 | 10.00 |
| Max | 100.00 | 100.00 | 100.00 |
The exploratory data analysis identified several key factors affecting student performance:
The data ingestion process involves reading the raw dataset, performing initial processing, and splitting it into training and testing sets.
src/components/data_ingestion.pynotebooks/dataset/stud.csvtrain_test_split with random_state=42artifacts/data.csv - Full datasetartifacts/train.csv - Training splitartifacts/test.csv - Testing split
# Key implementation details
def initiate_data_ingestion(self):
# Read dataset
df = pd.read_csv('notebooks/dataset/stud.csv')
# Create directories
os.makedirs(os.path.dirname(self.ingestion_config.train_data_path), exist_ok=True)
# Split dataset
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)
# Save datasets
train_set.to_csv(self.ingestion_config.train_data_path, index=False, header=True)
test_set.to_csv(self.ingestion_config.test_data_path, index=False, header=True)
return self.ingestion_config.train_data_path, self.ingestion_config.test_data_path
Data validation ensures the quality and integrity of the dataset before proceeding to model training.
The data transformation process converts raw data into a format suitable for model training, including handling categorical features and scaling numerical values.
The DataTransformation class in src/components/data_transformation.py performs the following:
# Key implementation details
def get_data_transformer_object(self):
# Define numerical and categorical features
numerical_columns = ["writing_score", "reading_score"]
categorical_columns = [
"gender",
"race_ethnicity",
"parental_level_of_education",
"lunch",
"test_preparation_course",
]
# Create preprocessing pipelines
num_pipeline = Pipeline(
steps=[
("scaler", StandardScaler())
]
)
cat_pipeline = Pipeline(
steps=[
("one_hot_encoder", OneHotEncoder()),
("scaler", StandardScaler(with_mean=False))
]
)
# Combine pipelines using ColumnTransformer
preprocessor = ColumnTransformer(
[
("num_pipeline", num_pipeline, numerical_columns),
("cat_pipeline", cat_pipeline, categorical_columns)
]
)
return preprocessor
def initiate_data_transformation(self, train_path, test_path):
# Load train and test data
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
# Get preprocessor object
preprocessing_obj = self.get_data_transformer_object()
# Define target column
target_column_name = "math_score"
# Split into features and target
input_feature_train_df = train_df.drop(columns=[target_column_name], axis=1)
target_feature_train_df = train_df[target_column_name]
input_feature_test_df = test_df.drop(columns=[target_column_name], axis=1)
target_feature_test_df = test_df[target_column_name]
# Apply transformations
input_feature_train_arr = preprocessing_obj.fit_transform(input_feature_train_df)
input_feature_test_arr = preprocessing_obj.transform(input_feature_test_df)
# Convert to arrays for modeling
train_arr = np.c_[input_feature_train_arr, np.array(target_feature_train_df)]
test_arr = np.c_[input_feature_test_arr, np.array(target_feature_test_df)]
# Save preprocessor object
save_object(
file_path=self.data_transformation_config.preprocessor_obj_file_path,
obj=preprocessing_obj
)
return train_arr, test_arr, self.data_transformation_config.preprocessor_obj_file_path
The model training process involves evaluating multiple regression algorithms and selecting the best performing model through hyperparameter tuning.
The project employs a comprehensive approach to model selection, evaluating seven different regression algorithms with rigorous hyperparameter tuning to find the optimal predictor for student math scores.
A simple baseline model that assumes a linear relationship between features and target.
Ensemble of decision trees that handles non-linear relationships and feature interactions well.
Tuned parameters: n_estimators
Single decision tree offering good interpretability and feature importance rankings.
Tuned parameters: criterion
Sequential ensemble method that builds trees to correct errors of previous trees.
Tuned parameters: learning_rate, subsample, n_estimators
Advanced gradient boosting implementation known for high performance and speed.
Tuned parameters: learning_rate, n_estimators
Gradient boosting algorithm that handles categorical features effectively.
Tuned parameters: depth, iterations, learning_rate
Boosting algorithm that weights misclassified samples higher in subsequent iterations.
Tuned parameters: learning_rate, n_estimators
The model training process is implemented in src/components/model_trainer.py. The key steps include:
artifacts/model.pklThe ModelTrainer class in src/components/model_trainer.py handles model training and selection:
# Key implementation details
def initiate_model_trainer(self, train_array, test_array):
try:
# Split arrays into features and target
X_train, y_train, X_test, y_test = (
train_array[:,:-1],
train_array[:,-1],
test_array[:,:-1],
test_array[:,-1]
)
# Define models to evaluate
models = {
"Random Forest": RandomForestRegressor(),
"Decision Tree": DecisionTreeRegressor(),
"Gradient Boosting": GradientBoostingRegressor(),
"Linear Regression": LinearRegression(),
"XGBRegressor": XGBRegressor(),
"CatBoosting Regressor": CatBoostRegressor(verbose=False),
"AdaBoost Regressor": AdaBoostRegressor(),
}
# Define hyperparameter grids
params = {
"Decision Tree": {
'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
'splitter': ['best', 'random'],
'max_features': ['sqrt', 'log2'],
},
"Random Forest": {
'n_estimators': [8, 16, 32, 64, 128, 256],
'criterion': ['squared_error', 'absolute_error'],
'max_features': ['sqrt', 'log2'],
},
"Gradient Boosting": {
'learning_rate': [.1, .01, .05, .001],
'subsample': [0.6, 0.7, 0.75, 0.8, 0.85, 0.9],
'n_estimators': [8, 16, 32, 64, 128, 256]
},
"Linear Regression": {},
"XGBRegressor": {
'learning_rate': [.1, .01, .05, .001],
'n_estimators': [8, 16, 32, 64, 128, 256]
},
"CatBoosting Regressor": {
'depth': [6, 8, 10],
'learning_rate': [0.01, 0.05, 0.1],
'iterations': [30, 50, 100]
},
"AdaBoost Regressor": {
'learning_rate': [.1, .01, 0.5, .001],
'n_estimators': [8, 16, 32, 64, 128, 256]
}
}
# Evaluate models
model_report = evaluate_models(
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
models=models,
param=params
)
# Get the best model
best_model_score = max(sorted(model_report.values()))
best_model_name = list(model_report.keys())[
list(model_report.values()).index(best_model_score)
]
best_model = models[best_model_name]
# Save the best model
save_object(
file_path=self.model_trainer_config.trained_model_file_path,
obj=best_model
)
return best_model_name, best_model_score
except Exception as e:
raise CustomException(e, sys)
The project uses GridSearchCV to find the optimal hyperparameters for each model:
After evaluation, the models showed the following performance (R² Score):
| Model | R² Score |
|---|---|
| Ridge | 0.880593 |
| Linear Regression | 0.880345 |
| CatBoosting Regressor | 0.851632 |
| AdaBoost Regressor | 0.849847 |
| Random Forest Regressor | 0.847291 |
| Lasso | 0.825320 |
| XGBRegressor | 0.821589 |
| K-Neighbors Regressor | 0.783813 |
| Decision Tree | 0.760313 |
After evaluating all models with their optimal hyperparameters, the best performing model is selected based on the R² score on the test set. This model is then saved for deployment in the prediction pipeline.
# Find best model
best_model_score = max(sorted(model_report.values()))
best_model_name = list(model_report.keys())[list(model_report.values()).index(best_model_score)]
best_model = models[best_model_name]
if best_model_score < 0.6:
raise CustomException("No best model found")
logging.info(f"Best found model on both training and testing dataset: {best_model_name}")
save_object(
file_path=self.model_trainer_config.trained_model_file_path,
obj=best_model
)
The model evaluation process involved assessing the performance of each model using multiple metrics.
Ridge Regression achieved the highest R² Score of 0.88, indicating that it explains 88% of the variance in math scores.
Cross-validation was performed to ensure the model's robustness and generalization capability.
The project includes experiment tracking to monitor the training process and model performance.
The catboost_info/ directory contains logs and metrics from CatBoost training:
catboost_training.json: Training parameters and configurationlearn_error.tsv: Training errors across iterationstime_left.tsv: Time estimation logsevents.out.tfevents: TensorBoard compatible event filesWhile this project doesn't explicitly implement a dedicated experiment tracking system like MLflow or Weights & Biases, it employs structured logging and artifact management to track model development:
A custom logging module records all steps in the machine learning pipeline, capturing information about data processing, model training, and evaluation metrics. This creates a historical record of the model development process.
import logging
import os
from datetime import datetime
LOG_FILE = f"{datetime.now().strftime('%m_%d_%Y_%H_%M_%S')}.log"
logs_path = os.path.join(os.getcwd(), "logs", LOG_FILE)
os.makedirs(logs_path, exist_ok=True)
LOG_FILE_PATH = os.path.join(logs_path, LOG_FILE)
logging.basicConfig(
filename=LOG_FILE_PATH,
format="[ %(asctime)s ] %(lineno)d %(name)s - %(levelname)s - %(message)s",
level=logging.INFO,
)
Trained models and preprocessing objects are serialized and saved in the artifacts directory. This ensures that each model version can be retrieved and compared with other versions over time.
def save_object(file_path, obj):
try:
dir_path = os.path.dirname(file_path)
os.makedirs(dir_path, exist_ok=True)
with open(file_path, "wb") as file_obj:
dill.dump(obj, file_obj)
except Exception as e:
raise CustomException(e, sys)
Model performance metrics are logged for each trained model, allowing for comparison between different algorithms and hyperparameter configurations. This information guides the selection of the best model for deployment.
The model was deployed as a web application using Flask, allowing users to input student information and receive predicted math scores.
The app.py file sets up a Flask application with the following features:
# Key implementation details from app.py
from flask import Flask, request, render_template
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from src.pipeline.predict_pipeline import CustomData, PredictPipeline
application = Flask(__name__)
app = application
@app.route('/')
def index():
return render_template('index.html')
@app.route('/predictdata', methods=['GET', 'POST'])
def predict_datapoint():
if request.method == 'GET':
return render_template('home.html')
else:
data = CustomData(
gender=request.form.get('gender'),
race_ethnicity=request.form.get('ethnicity'),
parental_level_of_education=request.form.get('parental_level_of_education'),
lunch=request.form.get('lunch'),
test_preparation_course=request.form.get('test_preparation_course'),
reading_score=float(request.form.get('writing_score')),
writing_score=float(request.form.get('reading_score'))
)
pred_df = data.get_data_as_data_frame()
predict_pipeline = PredictPipeline()
results = predict_pipeline.predict(pred_df)
return render_template('home.html', results=results[0])
if __name__ == "__main__":
app.run(host="0.0.0.0", port=80)
The src/pipeline/predict_pipeline.py file implements the prediction functionality:
The application was containerized using Docker for consistent deployment across environments.
FROM python:3.9-slim
WORKDIR /app
COPY . /app
RUN apt update -y
RUN apt-get update && pip install -r requirements.txt
CMD ["python3", "app.py"]
The Dockerfile specifies:
The .dockerignore file excludes unnecessary files from the Docker build context:
venv
.git
__pycache__
*.log
The project implements a CI/CD pipeline using GitHub Actions to automate the build and deployment process.
The CI/CD pipeline is defined in .github/workflows/main_studentperformancecheck.yml and includes the following steps:
name: Build and deploy container app to Azure Web App - studentperformancecheck
on:
push:
branches:
- main
workflow_dispatch:
jobs:
build:
runs-on: 'ubuntu-latest'
steps:
- uses: actions/checkout@v2
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Log in to registry
uses: docker/login-action@v2
with:
registry: https://testdockerafroz.azurecr.io/
username: ${{ secrets.AzureAppService_ContainerUsername_f24c206db1e145f79bfb160d64f62f1d }}
password: ${{ secrets.AzureAppService_ContainerPassword_2f191264d14048f9979084413f811090 }}
- name: Build and push container image to registry
uses: docker/build-push-action@v3
with:
push: true
tags: testdockerafroz.azurecr.io/${{ secrets.AzureAppService_ContainerUsername_f24c206db1e145f79bfb160d64f62f1d }}/studentperformance1:${{ github.sha }}
file: ./Dockerfile
deploy:
runs-on: ubuntu-latest
needs: build
environment:
name: 'production'
url: ${{ steps.deploy-to-webapp.outputs.webapp-url }}
steps:
- name: Deploy to Azure Web App
id: deploy-to-webapp
uses: azure/webapps-deploy@v2
with:
app-name: 'studentperformancecheck'
slot-name: 'production'
publish-profile: ${{ secrets.AzureAppService_PublishProfile_a291bb02fa174668a5f7d1ca7a8cc164 }}
images: 'testdockerafroz.azurecr.io/${{ secrets.AzureAppService_ContainerUsername_f24c206db1e145f79bfb160d64f62f1d }}/studentperformance1:${{ github.sha }}'
The application is deployed on Azure using Azure Web App and Azure Container Registry.
Hosts Docker images built by the CI/CD pipeline, providing version control and secure storage for container images.
Hosts the containerized application, providing a fully managed platform for running the prediction service with automatic scaling and high availability.
Hosted the containerized application, providing a fully managed platform for running the prediction service with automatic scaling and high availability.
The deployment architecture follows these steps:
The Student Success Predictor project successfully demonstrates an end-to-end machine learning pipeline, from exploratory data analysis to model deployment in production.
The project provided several valuable insights: